Add node debug tool with tests #38

harche · 2025-10-01T19:23:35Z

Tested with claude code,

> what is the version of cgroup the node harpatilcluster000a7-m9vgd-worker-a-6jz6f is using? 

⏺ I'll check the cgroup version for the specified node.

⏺ kubernetes-mcp-server - Nodes: Debug Exec (MCP)(node: "harpatilcluster000a7-m9vgd-worker-a-6jz6f",
                                                 command: ["stat","-fc","%T","/sys/fs/cgroup/"])
  ⎿  cgroup2fs                                                                                       

⏺ cgroup v2

openshift-ci · 2025-10-01T19:24:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: harche
Once this PR has been reviewed and has the lgtm label, please assign ardaguclu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

harche · 2025-10-01T19:34:48Z

/hold

harche · 2025-10-01T19:39:23Z

/hold cancel

harche · 2025-10-01T22:56:10Z

/hold for fixing CI issues.

ardaguclu · 2025-10-02T04:02:51Z

I'll defer the review to;

@manusa @Cali0707 @matzew

harche · 2025-10-02T11:26:41Z

Failiures in the linter and security jobs are not related to the changes in this PR. The linter failures are getting addressed in #39, while security failures fixes are in #40

Cali0707

Thanks for working on this @harche

The code looks good overall, left a few comments throughout

I'll test this on an OpenShift cluster tmrw

Cali0707 · 2025-10-06T19:49:18Z

pkg/toolsets/core/nodes.go

+	internalk8s "github.com/containers/kubernetes-mcp-server/pkg/kubernetes"
+)
+
+func initNodes(_ internalk8s.Openshift) []api.ServerTool {


Nit: maybe we can drop the internalk8s.Openshift parameter here since we don't need it? For the initXYZ methods, there's no requirement on this parameter existing - it seems to be present in only some of them

Suggested change

func initNodes(_ internalk8s.Openshift) []api.ServerTool {

func initNodes() []api.ServerTool {

Addressed, thanks.

Cali0707 · 2025-10-06T19:56:41Z

pkg/kubernetes/nodes.go

+//
+// When namespace is empty, the configured namespace (or "default" if none) is used. When image is empty the
+// default debug image is used. Timeout controls how long we wait for the pod to complete.
+func (k *Kubernetes) NodesDebugExec(


Would it be possible to split this function up a little bit? IMO it is getting quite large and is responsible for too much

Maybe we can create functions that:

create the debug pod

poll for debug completion

Retrieve the logs

Addressed, thanks.

Cali0707 · 2025-10-06T20:00:10Z

pkg/kubernetes/nodes.go

+	// nodeDebugContainerName is the name used for the debug container, matching oc debug defaults.
+	nodeDebugContainerName = "debug"
+	// defaultNodeDebugTimeout is the maximum time to wait for the debug pod to finish executing.
+	defaultNodeDebugTimeout = 5 * time.Minute


We may want to lower this timeout as by default this will significantly exceed the client tool call connection timeout: https://github.com/modelcontextprotocol/typescript-sdk/blob/e0de0829019a4eab7af29c05f9a7ec13364f121e/src/shared/protocol.ts#L60

We probably also want to add support for progress notifications to be sent to the client for longer running tool calls like this one (cc @mrunalp @ardaguclu @matzew @manusa )

Reduced to 1 min, thanks.

Cali0707 · 2025-10-06T20:01:00Z

pkg/kubernetes/nodes.go

+		grace := int64(0)
+		_ = podsClient.Delete(deleteCtx, created.Name, metav1.DeleteOptions{GracePeriodSeconds: &grace})


nit: let's use ptr.To here like we do elsewhere

The refactored file at pkg/ocp/nodes_debug.go, now uses ptr.To. Thanks.

matzew · 2025-10-07T05:45:11Z

pkg/ocp/nodes_debug.go

+	defaultNodeDebugTimeout = 5 * time.Minute
+)
+
+// NodesDebugExec mimics `oc debug node/<name> -- <command...>` by creating a privileged pod on the target


Since this is ocp specific;
would it make sense, to group this functionality into some pkg/ocp package?

File pkg/ocp/nodes_debug.go is part of ocp package.

Hold on, I will move everything this PR adds from pkg/kubernetes to pkg/ocp so future rebasing avoids the conflicts.

Cali0707

@harche would it be possible to improve the error messages we provide?

When running this server with claude code claude was able to call the tool, but it frequently got errors such as:

⏺ ocp-debug - Nodes: Debug Exec (MCP)(node: "ip-10-0-112-253.us-east-2.compute.internal", command: ["systemctl","status","kubelet"])
  ⎿  Error: command exited with code 1 (Error)

I'm not sure if there is a way to get more info about what went wrong, but with the current lack of error messages it was hard for the agent to figure out what went wrong and how to fix it

ardaguclu · 2025-10-08T04:06:16Z

pkg/kubernetes/kubernetes.go

 type Kubernetes struct {
-	manager *Manager
+	manager          *Manager
+	podClientFactory func(namespace string) (corev1client.PodInterface, error)


Wouldn't these changes in here cause the divergence from upstream?. I think, in this repository we shouldn't allow any changes touching these packages such as kubernetes/, mcp/, etc.

My suggestion is to add first in upstream and simply use it in downstream. Downstream repository should touch only ocp/ directory.

yes, working on it, #38 (comment)

harche · 2025-10-15T19:42:55Z

/test test

Cali0707 · 2025-10-15T19:50:44Z

pkg/toolsets/core/toolset.go

 	return slices.Concat(
 		initEvents(),
 		initNamespaces(o),


IMO - let's not add any openshift-specific tools to the core toolset. I'm worried this may lead to conflicts in the future if we make any upstream changes.

Instead, let's create a new openshift specific toolset (maybe openshift-core) that will hold all the openshift only tools

@Cali0707 thanks for that feedback. Created new pkg/toolsets/openshift/ package with the openshift-core toolset, also moved nodes_debug_exec from core to openshift-core

harche · 2025-10-15T20:16:44Z

@harche would it be possible to improve the error messages we provide?

When running this server with claude code claude was able to call the tool, but it frequently got errors such as:
⏺ ocp-debug - Nodes: Debug Exec (MCP)(node: "ip-10-0-112-253.us-east-2.compute.internal", command: ["systemctl","status","kubelet"])
  ⎿  Error: command exited with code 1 (Error)
I'm not sure if there is a way to get more info about what went wrong, but with the current lack of error messages it was hard for the agent to figure out what went wrong and how to fix it

We return whatever the exit code and error message we get from executing that command.

openshift-ci · 2025-10-15T20:22:21Z

@harche: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

harche · 2025-10-15T20:27:10Z

/hold cancel

Cali0707 · 2025-10-21T14:09:51Z

pkg/ocp/nodes_debug.go

+	if terminated != nil {
+		if terminated.ExitCode != 0 {
+			errMsg := fmt.Sprintf("command exited with code %d", terminated.ExitCode)
+			if terminated.Reason != "" {
+				errMsg = fmt.Sprintf("%s (%s)", errMsg, terminated.Reason)
+			}
+			if terminated.Message != "" {
+				errMsg = fmt.Sprintf("%s: %s", errMsg, terminated.Message)
+			}
+			return logs, errors.New(errMsg)
+		}
+		return logs, nil


In here can you include the logs in the error message? If you look at how the eventual result to the client is created, the logs will not be returned in the event where there is an error. This makes it hard to know what went wrong

https://github.com/containers/kubernetes-mcp-server/blob/c3bc991237ab9f4aba7f642616f43cd970d27a53/pkg/mcp/mcp.go#L205-L225

Cali0707 · 2025-10-21T14:15:51Z

pkg/ocp/nodes_debug.go

+		select {
+		case <-pollCtx.Done():
+			return nil, nil, "", fmt.Errorf("timed out waiting for debug pod %s to complete: %w", podName, pollCtx.Err())
+		default:
+		}


Lets remove this select, as there is one at the bottom of the loop that is checking the same condition

Cali0707 · 2025-10-21T14:16:18Z

pkg/ocp/nodes_debug.go

+		select {
+		case <-pollCtx.Done():
+			return nil, nil, "", fmt.Errorf("timed out waiting for debug pod %s to complete: %w", podName, pollCtx.Err())
+		case <-ticker.C:
+		}


Could you add a comment here explaining that this is to wait before the next check? This confused me when I first ran into it at the bottom of the loop

Cali0707 · 2025-10-21T14:20:00Z

pkg/mcp/modules.go

 import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/config"
 import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/core"
 import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/helm"
+import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/openshift"


Could we move this to a openshift_modules.go to avoid merge conflicts with usptream if we introduce more toolsets there?

This is a good point. Keeping it separate will ease the catching up with upstream on this branch.

Cali0707 · 2025-10-21T14:21:49Z

pkg/config/config.go

 	return &StaticConfig{
 		ListOutput: "table",
-		Toolsets:   []string{"core", "config", "helm"},
+		Toolsets:   []string{"core", "config", "helm", "openshift-core"},


I'm not sure we want to add this to the default config here

On the one hand, it makes sense to have the openshift tools enabled by default in the openshift fork

However, this introduces lots of changes to the other config/toolset tests as now they all expect the openshift-core toolset to be there

I'm worried this will lead to a load of conflicts whenever we make changes to those upstream...

Any thoughts @matzew @manusa @mrunalp @ardaguclu ?

I guess I am fine w/ just running with explicit config for these extra tools, since it is a fork?

yeah - my only concern is a lot of the tests rely on this default config - see the large diff on the tests

This may cause a lot of diffs if we make changes to those upstream

Cali0707 · 2025-10-21T14:22:48Z

README.md

+- **nodes_debug_exec** - Run commands on an OpenShift node using a privileged debug pod with comprehensive troubleshooting utilities. The debug pod uses the UBI9 toolbox image which includes: systemd tools (systemctl, journalctl), networking tools (ss, ip, ping, traceroute, nmap), process tools (ps, top, lsof, strace), file system tools (find, tar, rsync), and debugging tools (gdb). Commands execute in a chroot of the host filesystem, providing full access to node-level diagnostics. Output is truncated to the most recent 100 lines, so prefer filters like grep when expecting large logs.
+  - `command` (`array`) **(required)** - Command to execute on the node via chroot. All standard debugging utilities are available including systemctl, journalctl, ss, ip, ping, traceroute, nmap, ps, top, lsof, strace, find, tar, rsync, gdb, and more. Provide each argument as a separate array item (e.g. ['systemctl', 'status', 'kubelet'] or ['journalctl', '-u', 'kubelet', '--since', '1 hour ago']).
+  - `image` (`string`) - Container image to use for the debug pod (optional). Defaults to registry.access.redhat.com/ubi9/toolbox:latest which provides comprehensive debugging and troubleshooting utilities.
+  - `namespace` (`string`) - Namespace to create the temporary debug pod in (optional, defaults to the current namespace or 'default').
+  - `node` (`string`) **(required)** - Name of the node to debug (e.g. worker-0).
+  - `timeout_seconds` (`integer`) - Maximum time to wait for the command to complete before timing out (optional, defaults to 300 seconds).


Can you re-generate this? This tool isn't in the core toolset anymore 😄

matzew · 2025-10-21T14:23:24Z

README.md


 - **projects_list** - List all the OpenShift projects in the current cluster

+- **nodes_debug_exec** - Run commands on an OpenShift node using a privileged debug pod with comprehensive troubleshooting utilities. The debug pod uses the UBI9 toolbox image which includes: systemd tools (systemctl, journalctl), networking tools (ss, ip, ping, traceroute, nmap), process tools (ps, top, lsof, strace), file system tools (find, tar, rsync), and debugging tools (gdb). Commands execute in a chroot of the host filesystem, providing full access to node-level diagnostics. Output is truncated to the most recent 100 lines, so prefer filters like grep when expecting large logs.


eventually we might want to group the tools into "kube" and "ocp" tools?

(Not relevant now, but generally wondering CC @manusa)

harche mentioned this pull request Oct 1, 2025

[RFE] node-specific tools #36

Open

openshift-ci bot requested a review from ardaguclu October 1, 2025 19:24

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025

harche force-pushed the main branch from dacc89d to f8e4a08 Compare October 1, 2025 19:38

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025

harche changed the title ~~Add node debug tool with tests~~ WIP: Add node debug tool with tests Oct 2, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025

harche force-pushed the main branch from f8e4a08 to fb7eed5 Compare October 2, 2025 11:25

harche changed the title ~~WIP: Add node debug tool with tests~~ Add node debug tool with tests Oct 2, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025

Cali0707 reviewed Oct 6, 2025

View reviewed changes

matzew reviewed Oct 7, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

harche force-pushed the main branch from fb7eed5 to fc61e2e Compare October 7, 2025 15:46

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

harche force-pushed the main branch from fc61e2e to 9aef094 Compare October 7, 2025 15:55

Cali0707 reviewed Oct 7, 2025

View reviewed changes

ardaguclu reviewed Oct 8, 2025

View reviewed changes

harche force-pushed the main branch from 9aef094 to 9183526 Compare October 8, 2025 12:40

swghosh mentioned this pull request Oct 10, 2025

MG-34: Add oc cli like must-gather collection to plan_mustgather tool #51

Open

harche force-pushed the main branch 2 times, most recently from c99af16 to 694501c Compare October 15, 2025 19:35

Cali0707 reviewed Oct 15, 2025

View reviewed changes

Add nodes_debug_exec tool in pkg/ocp package

9f4a546

harche force-pushed the main branch from 694501c to 9f4a546 Compare October 15, 2025 20:10

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2025

Cali0707 reviewed Oct 21, 2025

View reviewed changes

matzew reviewed Oct 21, 2025

View reviewed changes

	func initNodes(_ internalk8s.Openshift) []api.ServerTool {
	func initNodes() []api.ServerTool {

		grace := int64(0)
		_ = podsClient.Delete(deleteCtx, created.Name, metav1.DeleteOptions{GracePeriodSeconds: &grace})


		- projects_list - List all the OpenShift projects in the current cluster

		- nodes_debug_exec - Run commands on an OpenShift node using a privileged debug pod with comprehensive troubleshooting utilities. The debug pod uses the UBI9 toolbox image which includes: systemd tools (systemctl, journalctl), networking tools (ss, ip, ping, traceroute, nmap), process tools (ps, top, lsof, strace), file system tools (find, tar, rsync), and debugging tools (gdb). Commands execute in a chroot of the host filesystem, providing full access to node-level diagnostics. Output is truncated to the most recent 100 lines, so prefer filters like grep when expecting large logs.

Add node debug tool with tests #38

Are you sure you want to change the base?

Add node debug tool with tests #38

Uh oh!

Conversation

harche commented Oct 1, 2025

Uh oh!

openshift-ci bot commented Oct 1, 2025

Uh oh!

harche commented Oct 1, 2025

Uh oh!

harche commented Oct 1, 2025

Uh oh!

harche commented Oct 1, 2025

Uh oh!

ardaguclu commented Oct 2, 2025

Uh oh!

harche commented Oct 2, 2025

Uh oh!

Cali0707 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harche Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cali0707 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harche commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harche commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Oct 15, 2025

Uh oh!

harche commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harche Oct 7, 2025 •

edited

Loading

harche commented Oct 15, 2025 •

edited

Loading